Machine Learning on Amazon Retail Data
  • Code
  • By Bhavana
  1. Data Prep / EDA
  • Home
  • Introduction
  • Data Prep / EDA
  • Models and Methods
    • ARM (Association Rule Mining)
    • Naive Bayes
    • Clustering
    • Decision Trees
    • Neural Networks
    • Regression
    • SVM (Support Vector Machine)
  • Conclusions

On this page

  • Data Collection
  • Data Cleaning
  • Data Preprocessing / Visualization

Data Prep / EDA

Where the data source, processing, and visualization (EDA) is presented.

Data Collection

Amazon product information was scraped from the website using the API service ScraperAPI; this is because, as Amazon is a hugely popular website, they have many anti-scraping measures in place such as rate-limiting, IP blocking, dymamic loading, and such. Using the external API service, these limitations were able to be avoided. The search queries chosen to search for items were based on top 100 Amazon searches, found on this site and this site. An example of using the API, along with its core endpoint, is below.

import requests

payload = {
   'api_key': 'API_KEY',
   'query': 'iphone 15 charger',
   's': 'price-asc-rank'
}

response = requests.get('https://api.scraperapi.com/structured/amazon/search',
                        params=payload).json()

The jupyter notebook code for the web scraping can be found here.

Additionally, more data was used to supplement the existing data. Since the scraped data was only about 26K rows, a Kaggle dataset was used that contains more than one million rows, had around the same fields as the scraped data, and was also from the USA (many Amazon Kaggle datasets were from the non-US).

The raw data from both sources can be seen below in Table 1; the scraped raw data CSV can also be viewed here.

Table 1: The raw data from both datasets.
(a) The raw data scraped from Amazon using ScraperAPI
type position asin name image has_prime is_best_seller is_amazon_choice is_limited_deal stars total_reviews url availability_quantity spec price_string price_symbol price original_price section_name
0 search_product 17 B06ZY43PDR Amazon.com Gift Card in a Birthday Pop-Up Box https://m.media-amazon.com/images/I/71VoEvoetO... True False False False 4.9 53293.0 https://www.amazon.com/Amazon-com-Gift-Card-Bi... NaN {} $50.00$2,000.00 $ 50.00 NaN NaN
1 search_product 48 B0CRKFR1KX Thanks For Being My Sister Card - Funny Annive... https://m.media-amazon.com/images/I/71dYDteQ+2... True False False False NaN NaN https://www.amazon.com/VLPGifts-Thanks-Being-S... 2.0 {} $4.95 $ 4.95 NaN NaN
2 search_product 3 B08FP1C33H American Greetings Rainbow Party Supplies, Mul... https://m.media-amazon.com/images/I/71IHwT+D63... True False False False 4.7 979.0 https://www.amazon.com/American-Greetings-Mult... NaN {} $7.26 $ 7.26 {'price_string': '$8.49', 'price_symbol': '$',... NaN
3 search_product 41 B07P43CTD4 HandFan Portable Neck Fan, USB Rechargeable Pe... https://m.media-amazon.com/images/I/71Y8KkiDjc... True False False False 4.7 847.0 https://www.amazon.com/HandFan-Personal-Neckla... NaN {} $16.99 $ 16.99 NaN NaN
4 search_product 3 B0CKNJTTWY DR770x-2ch LTE 4G Cloud Dash cam Front and Rea... https://m.media-amazon.com/images/I/51RpBecono... False False False False NaN NaN https://www.amazon.com/DR770x-2ch-Cloud-Front-... NaN {} $1,056.79 $ 1056.79 NaN NaN
(b) The raw data gotten from Kaggle
asin title imgUrl productURL stars reviews price listPrice category_id isBestSeller boughtInLastMonth
0 B08QC9N9G7 Girl Wireless Gaming Headset, Cute Cat Ear Hea... https://m.media-amazon.com/images/I/61q2tV9QNF... https://www.amazon.com/dp/B08QC9N9G7 4.3 0 18.99 0.0 263 False 0
1 B00DUIFDJM Rit Dyes tan Liquid 8 oz. Bottle [Pack of 4 ] https://m.media-amazon.com/images/I/41gRl9aGTy... https://www.amazon.com/dp/B00DUIFDJM 5.0 1 26.04 0.0 2 False 0
2 B014QD012S Acrylic Felt Fabric RED / 72" Wide/Sold by The... https://m.media-amazon.com/images/I/5112oTr2k2... https://www.amazon.com/dp/B014QD012S 4.6 0 12.89 0.0 7 False 0
3 B0977J6NX7 Boys Short Sleeve Logo Tee Shirt (5, Heritage ... https://m.media-amazon.com/images/I/61wlDiTgxZ... https://www.amazon.com/dp/B0977J6NX7 4.1 8 19.99 23.0 84 False 0
4 B09TX89V9G Navy Blue Birthday Party Decorations Blue Conf... https://m.media-amazon.com/images/I/81FF-W6boY... https://www.amazon.com/dp/B09TX89V9G 4.6 0 25.99 0.0 13 False 100

Data Cleaning

The datasets were cleaned seperately, then concatenated, then some final steps were taken to clean it.

The steps to clean the web-scaped data were:

  • Add date_scraped column
  • Remove unecessary columns: type, position, has_prime, is_amazon_choice, is_limited_deal, availability_quantity, spec, price_string, price_symbol, section_name
  • Expand and fix original_price
  • Rename columns to match standard snake case for merging both datasets
  • Drop rows with no asin or name or price
  • Fill NaN reviews column with 0

The steps to clean the Kaggle data were:

  • Add date_scraped column
  • Remove unecessary columns boughtInLastMonth
  • Drop rows with any NaNs
  • Fix list_price 0 to be instead equal to price
  • Change category_id to actual category by using category table
  • Rename columns to match standard snake case for merging both datasets

And then, after they were concatenated, the steps to clean were:

  • Remove duplicates (by asin + date scraped)
  • Rename columns

The final cleaned (and concatenated) dataset can be seen in Table 2 (with the original raw data in Table 1):

Table 2: The final unioned, cleaned, and processed data.
Asin Name Image Url Is Best Seller Stars Reviews Url Price Date Scraped List Price Category
0 B09M8G89XT 10Ft Micro-USB Charger Cords Cables for Samsun... https://m.media-amazon.com/images/I/51s3ZhSfnW... False 4.6 0.0 https://www.amazon.com/dp/B09M8G89XT 9.99 2023-11-01 9.99 Televisions & Video Products
1 B00112DX8M CoverGirl Eye Enhancers 1 Kit Shadow - Snow Bl... https://m.media-amazon.com/images/I/61xpnTBEkF... False 4.4 0.0 https://www.amazon.com/dp/B00112DX8M 5.92 2023-11-01 6.99 Makeup
2 B0CBKHDNFC 30Pcs Antique Box Corner Protectors, Decorativ... https://m.media-amazon.com/images/I/81LuRJsM1v... False 0.0 0.0 https://www.amazon.com/dp/B0CBKHDNFC 11.49 2023-11-01 11.49 Baby Safety Products
3 B004UDMDWG SEGA INITIAL D STREET STAGE PSP the Best for P... https://m.media-amazon.com/images/I/61l6h7YjxM... False 4.3 0.0 https://www.amazon.com/dp/B004UDMDWG 0.00 2023-11-01 0.00 Sony PSP Games, Consoles & Accessories
4 B000CMJ16A Grote 47053 - Chrome Plated Rectangular Cleara... https://m.media-amazon.com/images/I/81tYYrhQjN... False 3.9 0.0 https://www.amazon.com/dp/B000CMJ16A 9.15 2023-11-01 10.42 Heavy Duty & Commercial Vehicle Equipment

The code for the data cleaning can be found here.

Data Preprocessing / Visualization

Various types of EDA were performed in order to examine the data; as a note, most visuals are interactive (zoomable, pannable, etc). The code for all visualizations can be found here.

Figure 1

Figure 1: A histogram of all categories of all Amazon products. Note scraped data did not have categories, but the Kaggle data did.

Figure 2

Figure 2: Stars vs number of reviews recieved by an amazon product, colored by whether the product was a best-seller.

Figure 3

Wordcloud for categories of products

Wordcloud for the names of products
Figure 3: Wordclouds (where more frequent appearing words are bigger) of the categories of products and the names of products.

Figure 4

Figure 4: Price vs list price of items with the same ASIN across dates scraped, with a trendline.

Given that we can see outliers in price affecting the plot of the graph, it was decided for analysis to only consider those prices most populous, aka prices less than $800.

Figure 5: Histogram of prices, colored by whether the change in price increased or decreased over time, for those items that were in both sets of data.
Figure 6: Price vs the difference in price, over the two sets of data, colored by whether the price diff increased or decreased.
Introduction
ARM (Association Rule Mining)